Search CORE

71 research outputs found

Audio source separation for music in low-latency and high-latency scenarios

Author: Marxer Piñón Ricard
Publication venue: 'Universitat Pompeu Fabra'
Publication date: 01/01/2013
Field of study

Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separació de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'ús de la regularització de Tikhonov com a mètode de descomposició de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimació i seguiment dels tons, que són passos crucials en molts mètodes de separació. A continuació utilitzem i avaluem el mètode de descomposició de l'espectre en tasques de separació de veu cantada, baix i percussió. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separació de la veu cantada, gràcies al modelatge de components específics, com la respiració i les consonants. Finalment, explorem l'ús de correlacions temporals i anotacions manuals per millorar la separació dels instruments de percussió i dels senyals musicals polifònics complexes.Esta tesis propone métodos para tratar las limitaciones de las técnicas existentes de separación de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los métodos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularización de Tikhonov como método de descomposición del espectro en el contexto de baja latencia. Lo comparamos con las técnicas existentes en tareas de estimación y seguimiento de los tonos, que son pasos cruciales en muchos métodos de separación. A continuación utilizamos y evaluamos el método de descomposición del espectro en tareas de separación de voz cantada, bajo y percusión. En segundo lugar, proponemos varios métodos de alta latencia que mejoran la separación de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiración y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separación de los instrumentos de percusión y señales musicales polifónicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

DNN Driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Author: Adeel Ahsan
Barker Jon
Gogate Mandar
Hussain Amir
Marxer Ricard
Publication venue: 'International Speech Communication Association'
Publication date: 31/07/2018
Field of study

Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation. For optimal AV features extraction and ideal binary mask (IBM) estimation, a hybrid DNN architecture is exploited to leverages the complementary strengths of a stacked long short term memory (LSTM) and convolution LSTM network. The comparative simulation results in terms of speech quality and intelligibility demonstrate significant performance improvement of our proposed AV mask estimation model as compared to audio-only and visual-only mask estimation approaches for both speaker dependent and independent scenarios

arXiv.org e-Print Archive

Stirling Online Research Repository (RIOXX)

HAL AMU

Stirling Online Research Repository

The third `CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

Author: Barker Jon
Marxer Ricard
Vincent Emmanuel
Watanabe Shinji
Publication venue: HAL CCSD
Publication date: 13/12/2015
Field of study

International audienceThe CHiME challenge series aims to advance far field speech recognition technology by promoting research at the interface of signal processing and automatic speech recognition. This paper presents the design and outcomes of the 3rd CHiME Challenge, which targets the performance of automatic speech recognition in a real-world, commercially-motivated scenario: a person talking to a tablet device that has been fitted with a six-channel microphone array. The paper describes the data collection, the task definition and the base-line systems for data simulation, enhancement and recognition. The paper then presents an overview of the 26 systems that were submitted to the challenge focusing on the strategies that proved to be most successful relative to the MVDR array processing and DNN acoustic modeling reference system. Challenge findings related to the role of simulated data in system training and evaluation are discussed

INRIA a CCSD electronic archive server

The CHiME challenges: Robust speech recognition in everyday environments

Author: Barker Jon
Marxer Ricard
Vincent Emmanuel
Watanabe Shinji
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 17/10/2016
Field of study

International audienceThe CHiME challenge series has been aiming to advance the development of robust automatic speech recognition for use in everyday environments by encouraging research at the interface of signal processing and statistical modelling. The series has been running since 2011 and is now entering its 4th iteration. This chapter provides an overview of the CHiME series including a description of the datasets that have been collected and the tasks that have been defined for each edition. In particular the chapter describes novel approaches that have been developed for producing simulated data for system training and evaluation, and conclusions about the validity of using simulated data for robust speech recognition development. We also provide a brief overview of the systems and specific techniques that have proved successful for each task. These systems have demonstrated the remarkable robustness that can be achieved through a combination of training data simulation and multicondition training, well-engineered multichannel enhancement and state-of-the-art discriminative acoustic and language modelling techniques

INRIA a CCSD electronic archive server

Efficient artifacts filter by density-based clustering in long term 3D whale passive acoustic monitoring with five hydrophones fixed under an Autonomous Surface Vehicle

Author: Ferrari Maxence
Giraudet Pascale
Glotin Hervé
Marxer Ricard
Poupard Marion
Prévot Jean-Marc
Soriano Thierry
Publication venue: HAL CCSD
Publication date: 17/06/2019
Field of study

International audiencePassive underwater acoustics allows for the monitoring of the echolocation clicks of cetaceans. Static hydrophone arrays monitor from a fixed location, however, they cannot track animals over long distances. More flexibility can be achieved by mounting hydrophones on a mobile structure. In this paper, we present the design of a small non-uniform array of five hydrophones mounted directly under the Autonomous Surface Vehicle (ASV) Sphyrna (also called an Autonomous Laboratory Vehicle) built by SeaProven in France. This configuration is made challenging by the 40cm aperture of the hydrophone array, extending only two meters below the surface and above the thermocline, thus presenting various artifacts. The array, fixed under the keel of the drone, is numerically stabilized in yaw and roll using the drone's Motion Processing Unit (MPU). To increase the accuracy of the 3D tracking computed from a four hour recording of a Sperm Whale diving several kilometers away, we propose an efficient joint filtering of the clicks in the Time Delay of Arrival (TDoA) space. We show how the DBSCAN algorithm efficiently removes any outlier detection among the thousands of transients, and yields to coherent high definition 3D tracks

Crossref

HAL AMU

High-frequency Near-field Physeter macrocephalus Monitoring by Stereo-Autoencoder and 3D Model of Sonar Organ

Author: Asch Mark
Barchasz Valentin
Ferrari Maxence
Gies Valentin
Glotin Hervé
Marxer Ricard
Sarano François
Sarano Véronique
Publication venue: HAL CCSD
Publication date: 17/06/2019
Field of study

International audiencePassive acoustics allow us to study large animals and obtain information that could not be gathered through other methods. In this paper we study a set of near-field audiovisual recordings of a sperm whale pod, acquired with a ultra high-frequency and small aperture antenna. We propose a novel kind of autoencoder, a Stereo-Autoencoder, and show how it allows to build acoustic manifolds in order to increase our knowledge regarding the characterization of their vocalizations, and possible acoustic individual signature

Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw

Author: Chorowski Jan
Ciesielski Grzegorz
Dzikowski Jarosław
Marxer Ricard
Opala Mateusz
Pusz Piotr
Rychlikowski Paweł
Stypułkowski Michał
Łańcucki Adrian
Publication venue
Publication date: 22/06/2021
Field of study

We present a number of low-resource approaches to the tasks of the Zero Resource Speech Challenge 2021. We build on the unsupervised representations of speech proposed by the organizers as a baseline, derived from CPC and clustered with the k-means algorithm. We demonstrate that simple methods of refining those representations can narrow the gap, or even improve upon the solutions which use a high computational budget. The results lead to the conclusion that the CPC-derived representations are still too noisy for training language models, but stable enough for simpler forms of pattern matching and retrieval.Comment: Published in Interspeech 202

arXiv.org e-Print Archive

HAL AMU

Aligned Contrastive Predictive Coding

Author: Chorowski Jan
Ciesielski Grzegorz
Dzikowski Jarosław
Marxer Ricard
Opala Mateusz
Pusz Piotr
Rychlikowski Paweł
Stypułkowski Michał
Łańcucki Adrian
Publication venue
Publication date: 22/06/2021
Field of study

We investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss to extract slowly varying latent representations. Rather than producing individual predictions for each of the future representations, the model emits a sequence of predictions shorter than that of the upcoming representations to which they will be aligned. In this way, the prediction network solves a simpler task of predicting the next symbols, but not their exact timing, while the encoding network is trained to produce piece-wise constant latent codes. We evaluate the model on a speech coding task and demonstrate that the proposed Aligned Contrastive Predictive Coding (ACPC) leads to higher linear phone prediction accuracy and lower ABX error rates, while being slightly faster to train due to the reduced number of prediction heads.Comment: Published in Interspeech 202

arXiv.org e-Print Archive

HAL AMU

remote speech technology for speech professionals the cloudcast initiative

Author: André Coy
Frank Rudzicz
Heidi Christensen
Lorenzo Desideri
Maria Yancheva
Massimuliano Malavasi
Phil Green
Ricard Marxer
Stuart Cunningham
Publication venue
Publication date: 01/01/2015
Field of study

Clinical applications of speech technology face two challenges. The first is data sparsity. There is little data available to underpin techniques which are based on machine learning and, because it is difficult to collect disordered speech corpora, the only way to address this problem is by pooling what is produced from systems which are already in use. The second is personalisation. This field demands individual solutions, technology which adapts to its user rather than demanding that the user adapt to it. Here we introduce a project, CloudCAST, which addresses these two problems by making remote, adaptive technology available to professionals who work with speech: therapists, educators and clinicians. Index Terms: assistive technology, clinical applications of speech technolog

Crossref

Open Access Repository